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First, I would like to thank the three discussants 
(Glen Meeden, Joe Sedransk and Eric Slud) for con- 
structive comments on my paper and for provid- 
ing additional relevant references, particularly on 
frequentist model diagnostics (Slud) and Bayesian 
model checking (Sedransk). I totally agree with Se- 
dransk that studying alternative methods of making 
inference for finite populations is an "underserved 
field of research." I will first address the construc- 
tive comments of the discussants on the comparison 
of methods for handling sampling errors in the con- 
text of estimation with fairly large domain samples. 
Subsequently, I will respond to the discussions on 
small area estimation. 

HANSEN ET AL. EXAMPLE 

In Section 3.2, 1 cited the well-known Hansen, Ma- 
dow and Tepping (HMT) example illustrating the 
dangers of using model-dependent methods with fair- 
ly large samples even under minor model misspecifi- 
cations. Sedransk argues in his discussion that new 
advances in model diagnostics, such as model aver- 
aging, might remedy the difficulty noted by HMT 
and provide improvements over the "straw man, the 
usual ratio estimator." I agree with Sedransk that 
it would be worthwhile analyzing this example and 
other examples to show how one can make valid 
mo del- dependent inferences routinely with fairly lar- 
ge domain samples that can provide significant im- 
provements over the design-based (possibly model- 
assisted) methods, particularly in the context of of- 
ficial statistics with many variables of interest. If 
this goal can be achieved, then I believe model- 
dependent methods (frequentist or Bayesian) will 
have significant impact on practice, similar to their 
current use in small area estimation with small do- 
main samples. The HMT example showed the im- 
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portance of using design weights under their de- 
sign with deep stratification by size and dispropor- 
tional sample allocation. The usual design unbiased 
weighted estimator is almost as efficient as the usual 
combined weighted ratio estimator under the HMT 
design because of deep stratification by size, so I do 
not agree with Sedransk's comment on the impor- 
tance of ratio estimator in the HMT example. It 
is interesting to note that under proportional sam- 
ple allocation, the BLUP estimator (unweighted ra- 
tio estimator) under the incorrectly specified ratio 
model is identical to the combined weighted ratio es- 
timator and hence it performs well because it is de- 
sign consistent, unlike under disproportional sample 
allocation. The HMT example demonstrated the im- 
portance of design consistency, and in fact as noted 
in Section 3.2, Little (1983) proposed restricting at- 
tention to models that hold for the sample and for 
which the corresponding BLUP estimator is design 
consistent. I have noted some limitations of this pro- 
posal in Section 3.2. It should be noted that the 
HMT illustration of the poor performance of the 
BLUP estimator used the repeated sampling design- 
based approach to evaluate confidence interval cov- 
erage. On the other hand, model-based inference is 
based on the distribution induced by the model con- 
ditional on the particular sample that has been drawn. 
However, Rao (1997) showed that the HMT conclu- 
sions still hold in the conditional framework because 
of the effective use of size information through size 
stratification. 

ROLE OF DESIGN WEIGHTS 

I will now turn to Meeden's useful comments on 
the role of design weights and the use of Polya pos- 
terior (PP) for making inferences after the sample 
is observed. As noted in Section 4.2, the PP ap- 
proach when applicable permits routine interval es- 
timation for any finite population parameter of in- 
terest through simulation of many finite populations 
from PP and this general interval estimation fea- 
ture of PP is indeed attractive. Meeden notes in 
his discussion that an R package is also available 
for simulating many complete populations. However, 
so far the PP methodology considered only simple 
designs that may satisfy the assumption that the 
un-sampled units are like the sampled units (ex- 
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changeability) which hmits its apphcabihty in prac- 
tice. Meeden agrees with my comment that the PP 
approach needs extension to more complex designs 
before it becomes attractive to users. Even for the 
simple designs where it is applicable, it would be 
useful to identify scenarios where the PP can per- 
form significantly better than the routine design- 
based methods in terms of confidence interval cover- 
age, especially in cases where the traditional meth- 
ods do not perform well; for example, the Woodruff 
interval on quantiles under size stratification noted 
in Section 1. Meeden notes the work of Lazar, Mee- 
den and Nelson (2008) on the constrained PP which 
incorporates known population information about 
auxiliary variables without any model assumptions 
about how the auxiliary variables are related to the 
variables of interest, similar to calibration estima- 
tion. It appears that the constraints allowed by this 
method are more flexible than those in the usual 
calibration estimation, such as the population me- 
dian falls in some known interval, and this feature 
might prove attractive to the user, especially due 
to the availability of an R package. However, the 
constrained PP could run into problems when the 
number of population constraints is large, similar to 
traditional calibration estimation. 

In his concluding remarks, Meeden says that one 
should not focus on estimating the variance of an es- 
timator, but this is a customary practice as it allows 
reporting estimated coefficient of variation (CV) of 
the estimator as a quality measure and the user can 
compute confidence interval from this variance es- 
timator for any desired confidence level using nor- 
mal approximation. Meeden also expresses concerns 
that the frequentist practice is often "obscured by 
the prominent and unnecessary role played by the 
design weights after the sample has been selected." 
But design weights or calibration weights are needed 
for asymptotically valid design-based inferences, al- 
though it is often necessary to modify the weights to 
handle special situations, such as outlier weights. In 
fact, the PP-based estimators of a population mean 
are often close to the traditional weighted estima- 
tors, for example under stratified random sampling. 

CALIBRATION ESTIMATORS 

Slud and I seem to agree on the limitations of 
model-dependent approaches (frequentist or Baye- 
sian) when the sample size in a domain of interest 
is sufficiently large: possible design inconsistency of 
the resulting estimators under minor model misspec- 



ifications, leading to erroneous inferences. In Sec- 
tion 3.1 I noted the popularity of model-free calibra- 
tion estimators in the large-scale production of offi- 
cial statistics from complex surveys because of their 
ability to produce common calibration weights and 
accommodate arbitrary number of user-specified cal- 
ibration constraints. In practice, design weights are 
adjusted first for unit nonresponse and then cali- 
brated to known user-specified totals. The calibra- 
tion weights are often modified to satisfy specified 
range restrictions and calibration constraints simul- 
taneously, but there is no guarantee that such mod- 
ified weights can be found. Rao and Singh (1997, 
2009) proposed a "ridge shrinkage" approach (as- 
suming complete response) to get around the latter 
problem by relaxing some calibration constraints in- 
crementally while satisfying the range restrictions. 
Slud mentions a new method he has developed re- 
cently (Slud and Thibaudeau, 2010) that can do si- 
multaneous weight adjustment for nonresponse, cal- 
ibration and weight compression. This method looks 
very interesting and his empirical results are encour- 
aging. But a solution satisfying specified range re- 
strictions on the weights may not exist and it would 
be interesting to extend the Rao-Singh approach 
to handle simultaneous nonresponse adjustment and 
calibration. 

I agree with Slud that if the weights and calibra- 
tion totals are correctly specified, the resulting cal- 
ibration estimator is design consistent even if the 
underlying working linear regression model uses an 
incorrect or incomplete set of predictor variables, as 
in the example of Section 3.1. The effect of gross 
misspecification of the working model is on the cov- 
erage performance of the associated confidence in- 
tervals and hence it is "more subtle than design- 
consistency" as noted by Slud. Incidentally, Dorf- 
man (1994) used this example to question the con- 
tention of Hansen and Tepping (1990) that "design- 
based estimators that happen to incorporate a model 
are inferentially satisfactory, despite failure of the 
model" and concluded that the results on coverage 
for the linear regression estimator calibrated on the 
population size N and the population total X "dra- 
matically call this contention into question." Dorf- 
man's statement may be correct in regard to cali- 
bration estimators based solely on user-specified to- 
tals Z, but as noted in Section 3.1 a model-assisted 
approach based on a working model obtained after 
some model checking to eliminate gross misspecifi- 
cation of the working model can lead to good confi- 
dence interval coverage in the Dorfman example. 
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ANALYSIS OF SURVEY DATA 

Section 3.3 of my paper on the analysis of com- 
plex survey data is somewhat brief due to my focus 
on estimating totals and means, but I should have 
mentioned goodness-of-fit tests that take account 
of survey design. I am thankful to Slud for point- 
ing this out and making reference to my own work 
(Rao and Scott, 1984) on goodness-of-fit chi-squared 
tests for cross-classified survey data based on log- 
linear models. I might add that Roberts, Rao and 
Kumar (1987) considered goodness-of-fit tests of lo- 
gistic regression models with categorical predictor 
variables and binary response. Graubard, Korn and 
Midthune (1997) extended the well-known Hosmer 
and Lemeshow (1980) grouping method of goodness- 
of-fit for logistic regression to complex survey data. 
Roberts, Ren and Rao (2009) studied goodness-of- 
fit tests for mean specification in marginal models 
for longitudinal survey data and obtained an ad- 
justed Hosmer and Lemeshow test using Rao-Scott 
corrections as well as a quasi-score test obtained by 
extending the method of Horton et al. (1999) to sur- 
vey data. 

Multilevel models for analysis of survey data are 
more complex than the marginal models for estimat- 
ing regression parameters because of the presence of 
random effects in the models. Goodness-of-fit meth- 
ods for two-level models, when the model holds for 
the sample, are available in the literature (e.g.. Pan 
and Lin, 2005) but very little is known for survey 
data in the presence of sample selection bias. I am 
presently studying model-checking methods for two- 
level models taking account of the survey design. 

SMALL AREA ESTIMATION 

Turning now to small area estimation, Slud notes 
"But one serious objection is that each response va- 
riable would require its own Bayesian model" unlike 
direct calibration estimators using common weights. 
Yet mo del- dependent small area methods (either HB 
or EB) are gaining acceptability because direct cal- 
ibration estimators are unreliable due to small sam- 
ple sizes. However, practitioners often prefer bench- 
marking the small area estimators to agree with a re- 
liable direct calibration estimator at a higher level. 

Sedransk notes that "almost all of the applica- 
tions use an area-level model" even though it makes 
strong assumptions such as known sampling vari- 
ances, as noted in Section 5. I agree with him that 



the quality of the smoothing methods used in prac- 
tice to get around the assumption of known sam- 
pling variances is questionable although smoothed 
sampling variance estimates may be satisfactory for 
point estimation. However, as noted in Section 5, 
area-level models remain attractive because the sam- 
pling design is taken into account through the direct 
estimators, and the direct estimators and the asso- 
ciated area-level covariates are more readily avail- 
able to the users than the corresponding unit-level 
sample data. Also, in using unit-level models one 
need to ensure that the population model holds for 
the sample and this could be problematic, although 
more complex methods have been proposed recently 
to handle sample selection bias in unit-level mod- 
els (Pfeffermann and Sverchkov, 2007). Neverthe- 
less, I agree with Sedransk that unit-level models 
should receive more attention in the future. 

Turning to HB model diagnostics, I have noted 
in Section 5 some difficulties with the commonly 
used posterior predictive p-value (PPP) for check- 
ing goodness-of-fit of a model because of "double 
use" of data. Alternative methods that have been 
proposed to avoid double use of data are more diffi- 
cult to implement, especially in the context of small 
area models as noted. Sedransk mentioned three ad- 
ditional references (Yan and Sedransk, 2006, 2007, 
2010) that studied alternative measures in the con- 
text of detecting unknown hierarchical structures 
under somewhat simplified assumptions. In particu- 
lar, Yan and Sedransk demonstrated that the unit- 
specific PPP-values act like uniformly distributed 
random variables under the simple mean null model 
(without random area effects) and hence a Q~Q plot 
should reveal departures from the model. They as- 
sumed normality and absence of outliers in their 
study, but it would be interesting to see if their 
unit-specific P-values can in fact detect nonnormal- 
ity of random effects, studied by Sinharay and Stern 
(2003). The use of unit-specific PPP-values might 
be more attractive than using the traditional PPP- 
function because it does not require the selection of 
an appropriate checking function, but further work 
is needed including the detection of nonnormality 
as noted above. Yan and Sedransk showed that the 
PPP-function, based on the F-statistic as the check- 
ing function, is very effective for detecting hierar- 
chical structure when the true model is correctly 
guessed as the mean model with random area effects. 
This seems to imply that the PPP-function is chosen 
to reject the null model and yet Sedransk criticizes 
the frequentist goodness-of-fit tests by saying that 
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"such tests are constructed to reject null hypothe- 
ses whereas one would like to accept a postulated 
model if the data are concordant with it." In the 
simulation study of Yan and Sedransk (2007) the F- 
statistic based PPP-value detected even small corre- 
lations when the sample size is large and the corre- 
sponding frequentist test would also lead to similar 
results. I do not agree with Sedransk that global fre- 
quentist goodness-of-fit tests necessarily reject the 
null model when the data are concordant with the 
model. In fact, many published papers have identi- 
fied models from real data, using frequentist tests. 
For example, Datta, Hall and Mandal (2011) devel- 
oped a frequentist model selection method by test- 
ing for the presence of small area random effects and 
applied the method to two real data sets involving 
13 and 23 areas, respectively. Their test is based on 
simple bootstrap methods and it is free of normality 
assumption. The null model in both applications is 
a regression model without random area effects and 
they showed that the frequentist p-value is as large 
as 0.2, suggesting that the data are concordant with 
the simpler null model. Slud mentioned the work 
of Jiang, Lahiri and Wu (2001) and Jiang (2001) 
on mixed linear model diagnostics in the frequentist 
framework. I personally prefer using prior- free fre- 
quentist methods for model checking because they 
can handle a variety of model deviations including 
selection of variables and random effects selection in 
linear or generalized linear mixed models (e.g., Jiang 
et al., 2008) and detection of outliers in multilevel 
models (Shi and Chen, 2008). A model selected by 
the frequentist methods can be further subjected to 
Bayesian selection methods if necessary before us- 
ing HB methods for inference. Slud notes difficulties 
with model checking in the context of SAIPE for 
sample counties where no poor children were seen. 
This is also the case for counties or areas not sam- 
pled. Model checking in those cases is indeed chal- 
lenging. 

Finally, Slud makes an important observation on 
goodness-of-fit tests when the primary interest is 
prediction: "excellent predictions can be provided 
through estimating models which are too simple to 
pass goodness-of-fit checks." Slud notes that this ob- 
servation "has not yet been formulated with mathe- 
matical care" and that both frequentists and Baye- 
sians will benefit by characterizing "which target pa- 
rameters and which combinations of true and over- 
simplified models could work in this way." In this 
context, the recent work of Jiang, Nguyen and Rao 



(2011) on best predictive small area estimation is 
relevant. This paper develops a new prediction pro- 
cedure, called observed best prediction (OBP), and 
shows that it can significantly outperform the tra- 
ditional EBLUP. 
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